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Blooms of the dinitrogen-fixing marine cyanobacterium Trichodesmium considerably contribute to new 
nitrogen inputs into tropical oceans. Intriguingly, only 60% of the Trichodesmium erythraeum IMS 101 
genome sequence codes for protein, compared with —85% in other sequenced cyanobacterial genomes. The 
extensive non- coding genome fraction suggests space for an unusually high number of unidentified, 
potentially regulatory non-protein-coding RNAs (ncRNAs). To identifj^ the transcribed fraction of the 
genome, here we present a genome-wide map of transcriptional start sites (TSS) at single nucleotide 
resolution, revealing the activity of 6,080 promoters. We demonstrate that T. erythraeum has the highest 
number of actively splicing group II introns and the highest percentage of TSS yielding ncRNAs of any 
bacterium examined to date. We identified a highly transcribed retroelement that serves as template repeat 
for the targeted mutation of at least 12 different genes by mutagenic homing. Our findings explain the 
non- coding portion of the T. erythraeum genome by the transcription of an unusually high number of 
non-coding transcripts in addition to the known high incidence of transposable elements. We conclude that 
riboregulation and RNA maturation-dependent processes constitute a major part of the Trichodesmium 
regulatory apparatus. 



The diazotrophic cyanobacterium Trichodesmium contributes significantly to the nitrogen (N) and carbon 
cycles of tropical and subtropical oceans where it forms large surface accumulations estimated to supply 60- 
80 Tg N y"^ of 100-200 Tg N y"^ assessed from total marine N2 fixation\ 
Extensive research on Trichodesmium demonstrates its unique genetic, cellular, and physiological character- 
istics related to N2 fixation, nutrient acquisition and assimilation^ Among the filamentous non-heterocystous 
diazotrophs of the genus Oscillatoria^, Trichodesmium is unique because it fixes N2 during the day^ It separates 
this process from photosynthetic oxygen evolution spatially via diazocytes where nitrogenase is localized^"^^, and 
by downregulation of PSII activity when N2 fixation is high^. 

Compared with other cyanobacteria, Trichodesmium erythraeum IMSlOl (T. erythraeum) possesses a rela- 
tively large^^ single chromosome of 7,750,108 nt containing 4,451 protein-coding genes (accession number 
NC_008312). A distinctive feature is its 40% non-protein-coding portion. The cyanobacterial average is only 
15%^, raising the question whether T. erythraeum may have an unusually high number of unidentified non- 
coding RNAs (ncRNAs), which frequently have a regulatory function 

The genome also harbors a large number of selfish DNA elements such as insertion sequence (IS) elements and 
transposases, possibly contributing to genome expansion and adaptation. Published numbers differ from 106^^ to 
"more than 150"^^ IS elements and approximately 350 entries in the COG transposase categories^. Moreover, 
there are several genes interrupted by one or multiple introns, primarily group II introns. Group II introns, a 
purely bacterial intron class, are thought to be evolutionary ancestors of eukaryotic spliceosomal introns^^. Only 
—25% of eubacterial genomes harbor group II introns, usually only one or a small number^^'^^. A diversity 
generating retroelement (DGR) was predicted in T. erythraeum based on sequence similarity to such an element 
from the Bordetella bacteriophage BPP-1^^. DGRs introduce sequence diversity into a short, defined section of a 
protein- coding gene without interrupting it. The mechanism is based on the hypermutation of a variable region 
(VR) by recombination with a mutated cDNA copy. This copy is generated by the DGR-encoded reverse 
transcriptase (hereafter referred to as retron-RT) from an RNA template encoded in close vicinity to the ret- 
ron-RT^^"^°. This RNA template is transcribed from a distinct promoter and is called the template repeat (TR). 
The mutations are induced by the retron-RT during reverse transcription at adenosine residues in the TR- 
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RNA^^'^^. cDNA synthesis is primed by pairing of homologous sites at 
the y ends of the target- DNA/TR-RN A sequences^^. The mutated 
cDNA then recombines into the target (VR) sites, replacing the ori- 
ginal sequence by a process called mutagenic homing. Consequently, 
these target sequences are hypermutated at specific positions when 
the TR and retron-RT are both active. The documented VR sites are 
located in the 3' regions of the target genes and usually there are one 
or two such targets in a DGR-containing bacterial genome^°. The 
presence of a TR transcript is a hallmark of DGR elements, yet no 
such transcript has been identified in T. erythraeum to date. 

Transcriptomic analyses allow the detection of all transcripts 
including non- coding RNAs (ncRNAs), short unidentified open- 
reading-frames (ORFs), as well as intron splicing and expression of 
selfish DNA. In the differential RNA-seq approach (dRNA-seq^^), 
the RNA pool is enriched with primary transcripts, based on the 
presence of the characteristic 5' triphosphate that results from the 
initiation of transcription. dRNA-seq identifies the suite of active 
transcriptional start sites (TSS), providing a precise genome- wide 
map of active promoters and insight into the regulatory system. 

With this study we address the possible functionality of the extens- 
ive non-coding genome fraction of T. erythraeum. We present a 
genome-wide map of 6,080 TSS combined with extensive verification 
experiments. We demonstrate that at least 40% of all promoters 
active under standard lab culture conditions produce non-protein- 
coding RNAs and that these transcripts accumulate in much larger 
amounts than mRNA. We identified the DGR-associated TR tran- 
script and demonstrate that this hypermutation-generating system is 
functional in natural Trichodesmium populations. Based on the 
determined TR sequence, we identified 12 different genes that are 
possibly diversified by this DGR. Furthermore, we demonstrate 17 
actively splicing group II introns in 10 different host genes in this 
unusual bacterium. 

Results 

The primary transcriptome of T. erythraeum. As opposed to the 
widely used RNA-seq approach where total RNA is randomly 
sequenced, the applied dRNA-seq approach differentiates between 
processed or partially degraded RNA and nascent or non-processed 
(referred to from here as: primary) RNA. The primary RNA is 
enriched in the 5' starts of transcripts, allowing the mapping of 
TSS at single nucleotide resolution. In total, 56,911,484 primary T. 
erythraeum sequence reads up to 100 nt long were obtained from 
duplicate laboratory cultures of T. erythraeum under standard 
growth conditions using Illumina sequencing. These were mapped 
to the chromosome and the TSS were inferred following the criteria 
presented in the methods section. According to their location in the 
genome, TSS were further classified into 4 groups: gTSS drive the 
expression of annotated genes (including rRNA, tRNA, mRNA), 
allowing 5 '-untranslated-regions (5'UTRs) of up to 200 nt; iTSS 
and aTSS are located internally or antisense to an annotated gene, 
driving the expression of internal and antisense RNAs (asRNA, a 
specific type of ncRNA), respectively; nTSS are all TSS located 
within intergenic regions and >200 nt from the start of a gene and 
transcribe ncRNAs. Figure 1 shows the distribution of all four types 
of TSS along the genome (Fig. lA) and that nTSS were, at 26.8%, 
almost as abundant as gTSS at 30% (Fig. IB). 14% of all TSS were 
aTSS (Fig. IB), driving the transcription of asRNAs for almost 15% of 
all annotated genes in T. erythraeum. The full list of 6,080 identified 
TSS, together with their expression strength, is presented in 
Supplementary Table SI, extending the available information on 
promoters or RNA 5' ends in T. erythraeum^^'^^ by three orders of 
magnitude. We provide a visualization of all mapped reads alongside 
the genome including all identified TSS, ncRNAs, and asRNAs as 
Supplementary information (T. erythraeum transcriptome plot), 
searchable with T. erythraeum locus tags, gene names, or by 
genome position. 



An example of the data obtained and how it has been interpreted is 
shown in Fig. 2 for the sufBCDS gene cassette (Tery_4355- 
Tery_4358), encoding the Fe-S cluster assembly protein complex. 
The order of these genes is well conserved in bacteria and the opero- 
nic transcriptional organization has been the subject of several stud- 
ies^^'^^. Although the gene arrangement in T. erythraeum implies a 
functional unit, it is interrupted by two long IGRs (intergenic 
regions) between sufB and sufC, and between sufD and sufS. Our 
mapping data determined that both of these long IGRs possess active 
promoters and TSS (Fig. 2, red graph). We conclude that the first 
gene sw/B, the middle two genes sufCD, and the last gene sufS of the 
sufBCDS gene cassette can, under certain conditions, be transcribed 
independently from each other. This example illustrates one aspect 
that can be followed by these data at the genome-wide scale. 

The 20 transcripts associated with the highest read counts are 
presented in Table 1. These 20 transcripts correspond to TSS ranked 
1-24 in Supplementary Table SI, because the TSS on ranks 2-6 all 
belong to one very abundant repeat- derived transcript. Nine of these 
top 20 transcripts are ncRNAs. These nine ncRNAs alone recruited 
55% of all TSS-associated primary reads. Among them is the pro- 
moter of one of the two rRNA operons and the RNA component of 
RNase P, a ribozyme involved in the maturation of tRNAs^^. All other 
ncRNAs in the top 20 have unknown function. Please note that the 
rRNA operon promoter appears only on rank 13. Whereas mature 
rRNA is by far the most abundant RNA in the cell, there are ncRNAs 
whose nascent form is much more abundant (Table 1). The most 
abundant mRNAs observed among the top 20 transcripts code for 
two copies of RNA binding proteins, RNP-la and RNP-lb {Tery_ 
4702 and Tery_4701), two copies of the psaK gene {Tery_3921, 
Tery_4846, photosystem I reaction center subunit X), an unusually 
GC-rich mRNA {Tery_3486, 60% GC), a DUF4327 family protein 
{Tery_2891) widely conserved in cyanobacteria, a transcriptional 
regulator of the Abr family {Tery_0493), two hypothetical proteins 
unique to T. erythraeum {Tery_38 15 and Tery_2341), and Tery_0600, 
a protein with remote similarity to a kinesin family- like protein from 
the brown alga Ectocarpus siliculosus. High read numbers were also 
associated with genes encoding photosystem I proteins or enzymes 
involved in tetrapyrrole metabolism such as heme oxygenase 
{Tery_0335) or coproporphyrinogen oxidase {Tery_1166). 

To identify possible unknown protein-coding genes we selected 
the initial set of 1,621 nTSS and 855 aTSS and searched the 300 nt 
following each of these TSS by blastX against Genbank for matches to 
protein-coding genes and evaluated the results with the program 
RNAcode^^. We identified 10 additional genes and 11 pseudogenes 
including 4 remnants of transposases (Supplementary Table S2). 
Furthermore, this analysis led to the reannotation of Tery_0573 by 
adding 69 amino acids to the N terminus of this large outer mem- 
brane protein and to the annotation of the 5' exon of an intron- 
interrupted phage-related lysozyme gene (compare also Table 2). 
Among the 10 novel genes are petN encoding the cytochrome b6-f 
complex subunit PetN, and psaM and psbM encoding the photosys- 
tem I reaction center subunit XII and photosystem II reaction center 
protein M, respectively. Another newly identified gene, from which 
one of the most abundant mRNAs in the cell was transcribed 
(Table 1), encodes a hypothetical protein similar to Sta7437_4063 
of Stanieria cyanosphaera PCC 7437. 

Expressed non-coding repeats and other non-coding transcripts 
in T. erythraeum. The most abundant RNAs (Table 1, ncf 104733 
and following) originated from a > 6,000 bp long tandem repeat 
array on the forward genome strand consisting of 7 repeats, 
736 bp to 973 bp long (Fig. 3A). The ncRNAs produced from this 
repeat array alone accounted for 44.7% of all TSS-associated primary 
reads. To verify the high expression of these repeats, northern 
hybridizations were performed (Fig. 3B). These confirmed their 
expression: Transcription of repeats 1-6 starts at base 310 of the 
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Figure 1 | Genome-wide distribution of TSS in T. erythraeum. (A) Distribution of all TSS > 800 reads along a linear plot of the chromosome (upper 
part, forward strand; lower part, reverse strand). The classification of each TSS is color-coded: annotated genes (gTSS, gray), gene-internally starting 
transcripts (iTSS, purple), non-coding transcripts from intergenic spacer regions (nTSS, green), antisense RNAs (aTSS, red). (B) Total counts of different 
TSS types compared with the portion of coding (CDS) and non-coding nucleotides in the genome (right). aTSS and nTSS both are non-protein-coding, 
iTSS may or may not code for protein. 



repeated sequence and the major detectable transcript is —260 nt. 
The transcript originating from repeat 0 is somewhat shorter. In view 
of their high expression, these ncRNAs may be functionally 
important. 

We also examined all other ncRNAs of unknown function within 
the top 20 expressed transcripts plus 3 additional strongly expressed 
ncRNAs (Table 1) in northern blot experiments (Fig. 3C) and 
included one of the newly identified ORFs (psbM, Fig. 3C.7) as an 
example of a highly expressed nTSS re-classified during the course of 
this work as gTSS. All these transcripts were readily detected. 

A certain class of ncRNAs is associated with the CRISPR (clustered 
regularly interspaced short palindromic repeats) prokaryotic anti- 
viral defense system that is encoded in the majority of more complex 



cyanobacterial genomes^^. The 3.3 kb intergenic spacer between the 
genes Tery_1947 and Tery_1949 encompasses four 34 nt direct 
repeats, which are transcribed and processed and can be folded into 
an imperfect hairpin (Fig. 4). Therefore, both in length and in sec- 
ondary structure, these resemble typical CRISPR-associated repeat 
elements^^. These features are hallmarks of an expressed small 
repeat-spacer array in T. erythraeum, categorized as a CRISPR super- 
class D by automated repeat-based classification^^. However, we 
could not find any CRISPR-associated (cas) genes in the genome. 

Further analysis of non-coding transcripts in the primary tran- 
scriptome indicated that 115 (7%) of the nTSS and 4 of the aTSS drive 
expression of transposase pseudogenes and other IS elements 
(Supplementary Table S3). Applying ISfinder^^ to the non- coding 
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Figure 2 | Example of the transcriptome data obtained. dRNA-seq (top) and RNA-seq (middle) read coverage over the T. erythraeum genomic region 
6712800-6720000 containing the FeS- cluster- assembly gene cassette sufBCDS on the forward and repressor sufR on the reverse strand. Numbers on 
dotted lines are the primary read counts. This example illustrates how TSS can be depicted from the primary reads (black downward arrows) generated by 
dRNA-seq, whereas the RNA-seq reads offer information on general coverage over the length of a gene or operon. Black horizontal arrows on the right 
indicate the direction of transcription. In T. erythraeum, separate TSS of different expression strengths for sufB, sufCD, and sufS were found (a, b, c). 
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Table 1 | The top 20 expressed TSS plus three further highly expressed nTSS (ranking at positions 30, 40, and 56), which were also verified 
using a northern blot (Fig. 3C). The respective locus tag, the transcript product and type, comments from the annotator, start position and 
orientation (f, forward; r, reverse strand) and the average read count after normalization is provided. The nTSS read counts are identical for 
ncfl 05575-ncfl 09467 due to their identical sequences, causing each read to be mapped only with 1 /5 to each of the 5 identical TSS. 
Consequently, the true read number for this particular sequence is 5 X 6,1 10,586 reads. HP, hypothetical protein 
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rnbpl a, RNA binding protein 
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3 '-3' overlap with ABC transporter 
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641 01 09r 
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ribosomal RNA 


nTSS 


200 nt upstream of major processing site 


4603568r 
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ncRNA 


nTSS 
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6950581 r 
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fraction of the genome, we determined that 10.3% of it consists of IS 
elements. Because 8.5% of all annotated genes encode transposases 
(Supplementary Table 84), these selfish DNA elements account for 
6.1% of the whole genome (i.e., 317,477 nt in IGRs plus 155,034 nt in 
genes). The fact that selfish DNA is abundant and transcribed in T. 



erythraeum indicated that we should examine the transcription of 
other types of mobile elements in the genome such as group II introns. 

Transcript maturation in T. erythraeum involves splicing of at 
least 17 group II introns. By allowing RNA-seq reads to be split 



Table 2 | Verified group II introns in T. erythraeum. The respective locus tag and host gene are provided, followed by information on the 
presence of a maturase gene in the intron (yes or no), the number of mapped split reads (raw counts, RNA-seq reads), and whether the 
intron has previously been predicted in the group II intron database^^ 



Locus tag 


Host gene 


Maturase 


Mapped split reads 


Predicted 


Tery 4799 


allophycocyanin B alpha subunit, apcD 


yes 


337 


yes 


Tery 4178 


phage related lysozyme (with peptidoglycan-binding domain) 


no 


21 


no 


Tery_2323 


cassette of gas vesicle proteins 


no 


37 


no 


Tery_3633 


phage related lysozyme 


no 


18 


no 


Tery_3633 


phage related lysozyme 


yes 


8 


yes 


Tery 3633 


phage related lysozyme 


possibly 


3 


no 


Tery 3304 


IMP dehydrogenase 


yes 


11 


no 


Tery 0433 


ribonucleotide reductase, nrdj 


yes 


2 


yes 


Tery_0433 


ribonucleotide reductase, nrdJ 


no 


10 


no 


Tery_0428 


ribonucleotide reductase, nrdJ 


yes 


10 


yes 


Tery_0008 


DNA polymerase III, dnaN 


no 


2 


no 


Tery_0008 


DNA polymerase III, dnaN 


no 


1 


no 


Tery_0008 


DNA polymerase III, dnaN 


no 


6 


no 


Tery_0008 


DNA polymerase III, dnaN 


no 


5 


no 


Tery 2080 


thymidylate synthase 


no 


3 


no 


Tery 1635 


putative metal dependent protease 


no 


3 


yes 


Tery 4732 


domain 1 of group II intron in RNase H 


yes 


2 


yes 
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Figure 3 | Accumulation of abundant non-coding transcripts. (A) dRNA-seq read counts over the genomic region 103000-1 10200 containing 7 tandem 
repeats, of which repeats 1-6 are >96% identical over their aligned length, repeat 6 is missing the terminal 20 bp and repeat 1 is missing the terminal 
130 bp. Repeat 0 is missing the first 228 bp, its aligned sequence is only —87% identical to repeats 1-6. The TSS at base 310 of the repeated sequence is a T 
in repeat 1, but a G in repeats 2-6, explaining the different read numbers mapped to these repeats (6,576,448 vs. 6,1 10,586 reads, on dotted lines), despite 
100% sequence identity for the following 216 bp. Orange boxes on the reverse strand show partial duplications of gene Tery_0080. Note that the 
transcription of the repeats occurs from the forward strand. (B) Northern blot analysis of repeat 0 compared with repeats 1-6. Long precursors were not 
observed in either of the hybridizations. The sizes of the individual bands were calculated from the original marker on the corresponding membrane with 
ECapt 15.06 and are prone to an error of approximately 10% (true for all following northern blots). (C) Northern blots for all other unknown 
ncRNAs from the top 20 (Table 1) including the ps^M transcript (panel 7) first classified as nTSS (rank 56 in Supplementary Table SI) and reclassified as 
gTSS after identification as a new ORF. The apcAB dicistronic mRNA (panel 6) has a very long 5'UTR, initially also misclassified as nTSS, although apcA 
has a second gTSS closer to the reading frame (black circle). The numbers underneath the images correspond to the panels below, showing the genomic 
location and size of each of the RNAs (red boxes, to scale). The largest RNA sizes observed in the blots were used, although occasionally they may be 
read-through due to imperfect termination (C.5 and C.7). Red graph: primary reads. Green graph: RNA-seq. Black double bars: peak rising out of scale. 
Black horizontal arrows: direction of transcription. For better visualization, only one marker lane is shown here and all images were scaled to fit this 
marker. Full size blots are available in the Supplementary Fig. S2. 



SCIENTIFIC REPORT: | 4 : 61 87 | DOI: 1 0.1 038/srep061 87 



5 



A 



I n — 




CRISPR pseudo 


pseudo 




fw 


n>n>n> n> i \> 






13019800 13020400 


13021000 [3021600 


[3022200 


[3022800 


rev <| 1 






<\ 


pseudo 






Tery 1949 


t 




IS 






Figure 4 | Remnant of a CRISPR repeat array. (A) The genomic location of the four CRSPR repeats (red boxes on the fv\^-strand) between the genes 
Tery_1947 and Tery_1949 with primary transcriptome coverage above and below. Due to the size cutoff for the lUumina library preparation (300- 
500 bp), a TSS was not detected for the CRISPR, but was for two adjacent pseudogenes (vertical arrows). The yellow bars below the CRISPR boxes indicate 
probe positions for the northern blot (B) that was performed to verify the expression of the repeat spacer array. Only probe 1 detected transcripts; the sizes 
correspond to a highly transcribed and processed CRISPR array. A precursor was also detected, albeit at very low abundance (420 nt). Full size blots are 
available in the Supplementary Fig. S2. (C) Secondary structure prediction of the first two repeats with their internal spacer. The repeats form the hairpins 
characteristic of CRISPR. 



during mapping to the genome (Fig. 5A), we obtained direct evidence 
for the in vivo splicing of 17 group II introns, some of which have 
been previously described (http://webapps2.ucalgary.ca/~groupii/;^^'^^"^^). 
Most were merely predicted, but had not had splicing demonstrated 
in T. erythraeum. These evidently active introns are located in the 
genes coding for the allophycocyanin B (AP-B) alpha subunit 
(apcD), an IMP dehydrogenase, a thymidylate synthase, the gas vesi- 
cle protein A, two different phage-related lysozymes^^, a putative 
metal-dependent protease, a ribonucleotide reductase, and the 
DNA polymerase III (Table 2). Northern blot analysis confirmed 
the presence of splicing intermediates as well as premature and 
mature mRNA forms for apcD, as an example (Fig. 5B). 

A Diversity Generating Retroelement and its associated template 
and variable target repeats (TR and VRs). An ncRNA belonging to 
a highly expressed retroelement was detected, originating from the 
nTSS 2632789r and accumulating as an abundant 265 nt transcript 
(Fig. 6A). Based on sequence similarity to its likely target genes and 
due to its location downstream of a specific retron-RT gene (Fig. 6B), 
we identified this ncRNA as the TR-RNA of a previously predicted 
DGR^'. 

Unprecedented, the DGR of T. erythraeum appears to target at 
least 12 genes simultaneously (12 VR sites), possibly rewriting 17 
codons and thus altering up to 17 carboxy-terminal amino acids in 
each of the respective proteins. In accordance with the known muta- 
tional mechanism^^'^^, the alignment of all 12 VRs with the TR clearly 
demonstrates that diversity at a given nucleotide position occurs only 
where an adenosine occurs in the TR (Fig. 6C, grey arrows). For each 
VR, the possible sequence diversity is 4"^^^ or > 10""^^. However, in five 
of the VRs, random mutations at non-adenosine sites have accumu- 
lated (Fig. 6C, orange arrows and circles). 70% (21 of 30) of these 



mutations do not lead to amino acid changes, as they are at third 
codon positions or belong to a diversified codon. In the T. ery- 
thraeum genome sequence, the necessary retron-RT gene is pseudo- 
genized by a single point mutation generating a stop codon at 
position 39 (annotated as Tery _1 728 and Tery _1 729). We rese- 
quenced all VR regions, but no differences from the genome 
sequence were observed, consistent with the pseudogenization of 
the retron-RT gene in T erythraeum. The presence of an intact 
DGR system in natural populations and its inactivation in culture 
is discussed below. 

Discussion 

The transcriptome analysis of Trichodesmium revealed several, here- 
tofore, undescribed features. Using the example of the sufBCDS gene 
cassette, we determined that genes thought to be operonic actually 
have individual promoters in T. erythraeum. This finding is consist- 
ent with observations that SufB can form different complexes with 
other Suf proteins'" such as SufB(2)C(2) and SufBC(2)D. The differ- 
ent composition of these complexes may rely on the capability to 
express the corresponding genes in non- stoichiometric amounts. 
Such situations that allow the separate regulation of individual genes 
of an apparent operon are most likely more frequent than previously 
thought, illustrated also by the transcriptional organization of alkane 
biosynthesis genes in T. erythraeum and other cyanobacteria'^. 
Generally, the long intergenic spacers in the T. erythraeum genome 
allow for the evolution of promoter sequences that can increase 
regulatory and metabolic flexibility. 

We examined transcription of the coding and non-coding genome 
fraction of cultured T. erythraeum and determined that, of the 6,080 
mapped active TSS, 1,621 were nTSS (26.7%), producing mostly 
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Figure 5 | Mapping of split RNA-seq reads for the identification of introns in transcribed genes. (A) Active group II intron splicing in the gene apcD 
coding for allophycocyanin-B (AP-B). Transcription and splicing is evident from the transcriptome data. 337 split exon reads (top panel) clearly mark the 
start and end of the interrupting intron (between the red boxes shown in the genome detail). The dRNA-seq reads (red) mark the TSS of the 5' exon, as 
well as two internal TSS of unknown function; the primary read counts are on the dotted lines. White boxes: genes as annotated in the genome. Red boxes: 
5' exon and the missing part of the 3' exon identified here. Black double bars: peak rising out of scale. The position of the probe used for the northern blot 
is indicated by an orange bar. (B) Northern blot hybridization of precursor RNA, different splicing intermediates, and the mature, spliced mRNA. 
Note that signal strength corresponds well to coverage of the probe with the different RNAs. Full size blots are available in the Supplementary Fig. S2. 



unknown, expressed ncRNAs. This percentage is higher than in 
other cyanobacteria: 11.8% nTSS in Nodularia spumigena, 12.1% 
in Synechocystis sp. PCC6803 and 10.3% in Anahaena sp. 
PCC7120^^. Notably, our sequencing approach omitted all RNAs 
< 300 nt, therefore we did not even capture smaller putative 
ncRNAs. Additionally, using different culture conditions would 
most likely further increase the number of mapped nTSS. With a 
very conservative approach, calculating 300 nt as the minimal length 
of each RNA we sequenced, we determined that at least 18.2% of the 
non-coding space was transcribed into ncRNAs and 5'UTRs of pro- 
tein-coding mRNAs. We determined the median 5'UTR length for 
T. erythraeum to be 52 nt, almost twice as long as the median 5'UTR 
lengths, only 27 and 29 nt, determined for two strains of Pro- 
chlorococcus^^, but only slightly longer that the 43 nt in H. pylorf\ 
and in accordance with the 52 nt in Synechocystis PCC6803^\ 

In addition to the high incidence of nTSS in the T. erythraeum 
genome, we also discovered that ncRNAs have much higher express- 
ion levels than mRNAs (Supplementary Figure SI). The top 8 
ncRNAs alone, excluding the rRNA-operon and RNase P, accounted 
for 54% of all TSS-associated primary reads, whereas the top 8 



mRNAs accounted for only 9% of these reads (Table 1). Intri- 
guingly, the most highly expressed RNA in the entire dataset origi- 
nated from a >6 kb array consisting of 7 tandem repeats (Table 1, 
Fig. 3A,B). Each repeat contains a duplicated 3' part of Tery_0080, the 
gene immediately downstream from the array on the reverse strand. 
Yet, the remaining repeat sequence has no similarity in T. erythraeum 
or any other genome in Genbank. These long repeats may originate 
from duplication events involving mobile genetic elements that 
inserted alien sequence stretches along with duplication events. 

We identified several new small proteins that were missing in the 
T. erythraeum genome annotation. Among them were three genes 
encoding photosynthetic proteins and one encoding a hypothetical 
protein, which was among the top 20 transcribed mRNAs in the 
culture (Supplementary Table S2). Although a few novel ORFs were 
found, the majority of nTSS and all aTSS give rise to different types of 
ncRNAs rather than unannotated mRNAs. 

The high incidence of actively transcribed repeated sequence ele- 
ments, transposable elements, and introns is unusual among bacteria 
studied so far. It has been discussed that transposable elements 
could contribute to adaptation of a species to its environment^^"^^ 
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Figure 6 | A diversity generating retroelement (DGR) targeting 12 different protein-coding genes. (A) High expression of the DGR template repeat 
(TR) with a main transcript length of 265 nt and putative precursors. Full size blots are available in the Supplementary Fig. S2. (B) Location of the TR (red 
box) , two neighboring VR genes (with variable regions in light pink) , and the retron-type reverse transcriptase (retron-RT) on the genome. The retron-RT 
is interrupted by a stop codon. The red graphs show the primary reads in this region with the count on the dotted line, for the reverse strand only. 
(C) The multiple alignment shows the variable regions (VRs) of 12 protein-coding genes of T. erythraeum (including the 3'UTRofthese genes) compared 
against the nucleotide sequence of the TR. The black framed boxes below the alignment indicate the codons corresponding to the reading frames of all 12 
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through their function as natural genetic engineering systems, thus 
speeding up genome evolution. Therefore, Trichodesmiums high 
versatility and ecological success may, in part, be linked to these 
features. 

Since natural Trichodesmium populations can be associated with 
massive bacteriophage populations^^ it is not surprising that we iden- 



tified a short CRISPR-Cas repeat-spacer array. It exhibits a typical 
pattern for the processing of crRNAs (CRISPR RNAs) as well as 
active transcription (Fig. 4). However, T. erythraeum does not have 
any cas genes. Cyanobacterial CRISPR-cas arrays are prone to gen- 
etic rearrangements including their substitution among closely 
related strains^^ or loss^^. Therefore, this repeat- spacer array may 
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be a remnant from a previous CRISPR- cas cassette whose main part 
was lost from this region. Indicative of such a development could be 
the three pseudogenes present in the same intergenic region 
(Fig. 4A), as they are all remnants of transposases. 

An alternative means to escape phages may be via the diversity 
generating retroelement (DGR) we identified here. In Bordetella 
phage BPP-1, the DGR diversifies the gene specifying tropism for 
host receptor molecules^^. In Trichodesmium, the opposite situation, 
namely a phage resistance mechanism by means of, possibly mul- 
tiple, surface molecule diversification is easily conceivable. Although 
mechanistically different, protein diversification by gene mutations 
is an escape strategy from bacteriophage in the marine cyanobacter- 
ium Prochlorococcus^^ . We show that T. erythraeum has the potential 
for targeted hypermutation of the 3' sequence (i.e., the variable 
repeat DNA sequences, VRs) of 12 protein-coding genes. 

To date, the maximum number of VR- containing target genes 
reported in other bacteria is four^°. The targeted sequences in T. 
erythraeum encode a C-terminal DUF323 (FGE sulfatase) domain, 
which is considered to recognize specific oligopeptide motifs^°. This 
DUF323 domain is fused to different N terminal regions; mostly 
protein kinase domains. Moreover, we discovered a cysteine- 
dependent aspartate-directed protease (Caspase, pfam00656) 
domain in Tery_2077. Caspases are well known to mediate autoca- 
talytic programmed cell death (PCD), yet are not present in cyano- 
bacteria. Trichodesmium is one of very few bacteria for which 
PCD -like mechanisms have been described^^"^° and contains a large 
number of putative metacaspases^^'^^ some of which are actively 
expressed during programmed cell death induction in Trichodes- 
mium'^'^. However, the DGR system appears to have been inactivated 
during the time between isolation and sequencing of T. erythraeum 
(T erythraeum was isolated and brought into culture in 1993^^ and its 
genome sequence was available 1 1 years later). The retron-RT neces- 
sary for generating the mutated cDNA from the TR^^ is disrupted by 
a stop codon generated by a point mutation. Although the TR-RNA 
was highly expressed, resequencing of the VR sequences did not 
reveal a single mutation compared with the available genome 
sequence. This is not surprising because selective pressures from 
the environment are lost in culture. Strikingly, most (7 out of 12) 
VR sequences have no point mutations at non-adenosine positions, 
but are identical among each other and the TR, despite the fact that 
the 3"^^^ positions of most codons can mutate without altering amino 
acids. Four of the five VRs that differ from the TR at non-adenosine 
sites (Fig. 6C, orange arrows) also have point mutations in the 35 bp 
long 3' conserved sequence stretch (Fig. 6C, red horizontal bar), 
which is most likely necessary for homing^^. These VRs are most 
likely not targets for the TR any longer and have lost the ability to 
diversify independent of inactivation of the retron-RT. Most of these 
non-targeted point mutations are silent, indicating that the proteins 
harboring these VRs are likely still active, although they are not 
currently DGR targets. To underline our hypothesis that the T. ery- 
thraeum DGR system has been inactivated in culture, but is active in 
wild populations, we did a search for the TR and associated retron- 
RT in available metagenomic datasets. Indeed, a dataset from wild 
Trichodesmium colonies off the coast at Oahu, Hawaii (IMG Sub- 
mission ID 8735) includes a 9,925 nt-long scaffold (TrichMG- 
DRAFT_c 100472) with perfect co-linearity to the corresponding 
region in the T. erythraeum genome. It harbors an intact retron- 
RT matching both the Tery_1728 and Tery_1729 genes with 94% 
nucleotide identity, followed by the TR (95% identity). Further- 
more, this scaffold includes two VR-containing genes with a 94% 
DNA identity to Tery_1727 and 92% to Tery_173L Therefore, it is 
very likely that this DGR is active in natural populations. In the 
future, it will be interesting to investigate the actual activity and 
regulation of the DGR in wild Trichodesmium populations. Using 
single cell genomics it should be possible to estimate the diversifica- 
tion rate within a population. 



All transcriptome reads can be downloaded from the NCBI 
Sequence Read Archive (BioProject ID PRJNA237745). 

Methods 

Culture media and growth conditions. T. erythraeum IMS 101 cultures were grown 
in YBCII medium'^ at 25°C, 12 : 12 light/dark cycle at ~80 |amol photons m"^s"^ 
white Ught. For transcriptome analysis, two replicate batch cultures were grown for 9 
days in sterile 1 L-Pyrex bottles with gentle air bubbling and were harvested in the 
middle of the light phase. For Northern blot verification experiments, separate 
cultures were grown in the same conditions. 

Extraction, analysis, library preparation and sequencing of total RNA. RNA from 
T. erythraeum cultures was prepared as previously described^'' with the following 
modifications. Filters with cells were placed in tubes containing 2 ml RNA 
resuspension buffer (10 mM sodium acetate, 200 mM D( + )-sucrose, 100 mMNaCl, 
5 mM EDTA) and snap-frozen by immersion in liquid nitrogen. Phenolic PGTX 
buffer (2 ml)^^ was added to the frozen samples directly prior to extraction. The 
quality and quantity of total extracted RNA was analyzed on a Shimadzu MultiNA 
microchip electrophoresis system. 

Total RNA (3-5 |ig) from each of the two cultures was used for the preparation of a 
primary transcriptome cDNA library following the dRNA-seq protocol of Sharma et 
al.^^ with some modifications (vertis Biotechnologie AG, Germany). Total RNA was 
treated with Terminator- 5 'P-dependent-exonuclease (TEX) to digest all processed 
and partly degraded RNA and enrich 5'PPP-containing primary transcripts. 
Enzymatic treatments are often not 100% effective, which is why the treated RNA was 
split into two subsamples to allow for a negative control library containing only the 
processed RNA that was not removed by TEX. The first subsample was prepared as 
described (+tobacco-acid-5' -pyrophosphatase, 5' RNA-adapter ligation, rando- 
mized first- strand cDNA synthesis, 3' lUumina adapter ligation) to produce the 
primary transcriptome library (named +5'PPase from here). From the second sub- 
sample tobacco-acid-5' -pyrophosphatase treatment was omitted (named -5'PPase 
from here). In this negative control, primary transcripts with a triple-phosphate 
(5'PPP) could not be ligated to the 5' lllumina adapter, whereas processed transcripts 
that were not removed by the TEX treatment could be ligated. The first cDNA 
amplification (18-20 cycles) was followed by cDNA fragmentation and isolation of 
300-500 bp long fragments. After ligation of the 3' lllumina adapter, a final PGR (4 
cycles) was performed. 

For the preparation of the non-enriched transcriptome (RNA-seq, all transcripts 
including processed ones), a random primed (N6) cDNA library was prepared from 
untreated, fragmented total RNA (mixed from both cultures) and lllumina TruSeq 
adapters ligated to the 5' and 3 ' ends of the cDNA. This library was amplified with 14 
PGR cycles. All libraries were sequenced on an lllumina HiSeq 2000 machine using 
single-read sequencing and 100 bp read length. 

The dRNA-seq preparation has an advantage through the enrichment of primary 
transcripts (non-processed, recently transcribed RNAs) as opposed to the RNA-seq 
approach, which does not enrich for particular RNA, but gives a general overview of 
RNA present in the cells, without information on which promoters are active at the 
time of sampling. To clarify, the TEX treatment removes the bulk of ribosomal RNA 
(rRNA) because functional rRNA in the cell is processed from a polycistronic tran- 
script and thus contains a 5 'P. However, the nascent polycistronic rRNA transcripts 
are not removed; the resulting sequences represent the activity of the rRNA operon 
promoter. 

Sequence analysis and computational methods. The read quality was verified using 
FastQG^^ and 99% of the sequences had a quality score >= 20 (accuracy >= 99%). 
Our quality analysis revealed that very few reads had a sequencing quality below 20 at 
their 3' ends and all 5' ends were of very high quality. Accumulation of sequencing 
errors results in a shorter alignment, but the reads will still be mapped if the 
alignments reach a certain length. Because we were interested in the 5' positions of the 
mapped reads for TSS identification, no trimming was performed. Sequencing reads 
from all libraries were mapped to the T. erythraeum genome using Segemehl, version 
0.1.6^^, with default parameters. If a read had several best matches on the genome, it 
was counted only as the corresponding fraction for each of these matches (i.e., for 6 
identical repeats, a mapping read was counted 1/6 for each of the repeats). For the 
identification of splice sites, the RNA-seq reads were remapped using the -S option to 
obtain higher mapping scores by the splitting of reads^*^. 

The + 5'PPase and — 5'PPase libraries from dRNA-seq were normalized separately 
with the TMM normalization method^^ implemented in edgeR*^" using all genome 
positions with more than 2 read starts per million and excluding reads mapping to the 
rRNA operons. The resulting read counts were multiplied by 100. The RNA-seq reads 
were not normalized. 

For the initial TSS mapping, at a specific genome position, two conditions had to be 
fulfilled: 1 ) In both of the duplicate + 5 ' PPase libraries, >270 normalized reads had to 
start at this position, corresponding to 27 and 30 raw reads, respectively. 2) The ratio 
between these read starts and all reads covering the position had to be >0.5^\ 
Beginning with the strongest TSS (most reads) and in decreasing order, all reads from 
weaker TSS positions within ± 10 nt were assigned to this TSS and the weaker TSS 
were discarded. In a final step, the reads from the negative control library (reads from 
processed transcripts contaminating the dRNA-seq library, see library preparation) 
were used to discard false TSS. A TSS was considered to be more likely a processing 
site than real start site of transcription if the ratio between the number of reads in the 
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dRNA-seq library ( + SPPase) and the negative control library ( — SPPase) fell below a 
certain threshold. To define a reliable threshold below which a TSS would be dis- 
carded, a randomly selected set of 30 TSS was manually classified as true or false 
positive resulting in the following exclusion criteria. A TSS was declared "false", if one 
of the two ratios was < 1 (more reads in the negative control library than in the 
corresponding dRNA-seq library) and the added + SPPase/ — SPPase-ratios of a TSS 
in both libraries was below 37. The latter criterion ensured that a TSS with a ratio 
below the threshold in one condition, but with a very good ratio in the other con- 
dition, was retained. These criteria for TSS prediction were deliberately conservative 
(samples of just discarded or just predicted TSS were verified manually) possibly 
producing false-negative TSS. 

To test for statistical significance, the 6,080 predicted TSS were compared with a 
TSSAR'^ TSS annotation. For 5,542 (91.15%) of our predicted TSS, a TSS (p < 0.05, 
noise threshold =10) was also annotatedby TSS AR within ±10 nt. Furthermore, 249 
of the 538 unconfirmed TSS are located in regions that could not be modelled by the 
zero-inflated Poisson regression applied by TSSAR. The unmodelled regions are 
small islands with high expression levels and, from manual examination, our pre- 
dicted TSS appear to be true positives. 

For the detection of new short ORFs, up to 300 intergenic nucleotides after each 
nTSS (less, if a coding sequence (CDS) was encountered before extension by 300 nt) 
were searched against the NCBI database using blastn and blastx with e < le-5. The 
blastx results were manually analyzed. From the blastn results, multiple alignments 
were generated using ClustalW*^^. These alignments were then evaluated for their 
coding potential with RNAcode^^. 

For the quantification of IS element remnants in ICRs, a blastx search with a 
threshold of e < le-5 was performed for each IGR against the IS-finder database^^ and 
all nucleotides related to at least one blast hit were counted. The same parameters 
were used to identify nTSS (TSS in non-coding regions) and aTSS (TSS antisense to an 
annotated gene) driving pseudogenized transposases (TSS + 300 nt, but only until a 
CDS was reached). For the identification of IS elements among annotated genes, a 
blastP search with all CDS features and e < le-5 was performed against the IS-finder 
database. 

The Artemis 15.0.0 genome browser^^ was used for browsing transcriptome data 
along the T. erythraeum genome sequence. Secondary structures of RNA molecules 
were computed on the Mfold^^ webserver. The Vienna file format was exported and 
the final structure visualized using VARNA*^^. 

Northern blot analysis. Blots for the detection of specific highly expressed transcripts 
were prepared from the separation of 3-5 \ig of total RNA on 10% polyacrylamide 
urea gels or 1.5% agarose gels with 16% formaldehyde, depending on the expected 
transcript sizes (up to 1,000 nt and above 1,000 nt, respectively). Blotting and 
hybridization with [a-32P]UTP-labelled transcript probes was performed as 
previously described^*. For the as_repeatl-6 [Y-32P]ATP-labeled oligonucleotide 
probe, hybridization and washing temperatures were 45°C and 40°C, respectively. 
Transcript probes were produced from PCR-generated templates using in vitro 
transcription (MaxiScript kit, Ambion). All DNA oligonucleotides used for PCR and 
hybridization experiments were purchased from Sigma- Aldrich (Hamburg, 
Germany) or Life Technologies (Darmstadt, Germany) at the 'desalted' quality 
(Supplementary Table S5). Unless stated otherwise, working concentrations of all 
oligonucleotides were 10 |iM. 
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